kourouklidesfandomcom-20200213-history
Data Science
This page contains resources about Data Science, including Data Engineering. Subfields and Concepts * Machine Learning / Data Mining * Exploratory Data Analysis * Data Preparation and Preprocessing * High Performance/Parallel/Distributed Computing for Machine Learning * Concurrent/Multi-threading Computing for Machine Learning * Data Engineering and Databases * Data Visualization * Big Data Online courses Video Lectures * How to Win a Data Science Competition: Learn from Top Kagglers - Coursera * Exploratory data analysis in Python by Chloe Mawer and Jonathan Whitmore - PyCon 2017 Lecture Notes * What is Data Science by Ioannis Kourouklides * When [to use and When Not to Use Distributed Machine Learning by Chih-Jen Lin] * Open Machine Learning Course (Medium) * Mining Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeff Ullman * Hardware Acceleration for Data Processing by Gustavo Alonso * CS109: Data Science Books * Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley. * Schutt, R., & O'Neil, C. (2013). Doing data science: Straight talk from the frontline. O'Reilly Media. * Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets. Cambridge University Press. (link) * Zumel, N., Mount, J., & Porzak, J. (2014). Practical data science with R. Manning. * Nolan, D., & Lang, D. T. (2015). Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving. CRC Press. * Elston, S. F. (2015). Data Science in the Cloud with Microsoft Azure Machine Learning and R. O'Reilly Media, Inc. * Grus, J. (2015). Data Science from Scratch: First Principles with Python. O'Reilly Media. * Madhavan, S. (2015). Mastering Python for Data Science. Packt Publishing Ltd. * Blum, A., Hopcroft, J., & Kannan, R. (2015). Foundations of Data Science. * VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data. O'Reilly Media. * Wickham, H., & Grolemund, G. (2017). R for Data Science. O'Reilly Media. Scholarly Articles * Xing, E. P., Ho, Q., Xie, P., & Wei, D. (2016). Strategies and principles of distributed machine learning on big data. Engineering, 2(2), 179-195. * Salloum, S., Dautov, R., Chen, X., Peng, P. X., & Huang, J. Z. (2016). Big data analytics on Apache Spark. International Journal of Data Science and Analytics, 1(3-4), 145-164. * Huang, Y., Zhu, F., Yuan, M., Deng, K., Li, Y., Ni, B., ... & Zeng, J. (2015). Telco Churn Prediction with Big Data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 607-618). ACM. * Moritz, P., Nishihara, R., Stoica, I., & Jordan, M. I. (2015). SparkNet: Training Deep Networks in Spark. arXiv preprint arXiv:1511.06051. Software * Docker (Containers) * Anaconda Distribution - Python * Beautiful Soup 4 - Python * ray - Python * multiprocessing - Python * threading - Python * auto_ml - Python * Elasticsearch, Logstash, Kibana (ELK) * MongoDB * Apache Solr * Apache Hadoop * Apache HBase * Apache Spark * Apache Hive * Apache Kafka, which includes Kafka Connect * Apache Cassandra * Apache ZooKeeper * Apache Pig * Apache Storm * Apache CouchDB * Apache ActiveMQ * RabbitMQ * pyspark - Spark Python API * tensorflow_scala - Scala API for TensorFlow * TensorFlowSharp - TensorFlow API for .NET languages * TensorFlowOnSpark - It brings TensorFlow programs onto Apache Spark clusters * Numba - Python See also *Algorithms Other Resources *Data Science Guide *Data Science Engineering, your way *Large Scale Machine Learning - libraries and papers *What are some courses on large scale learning? - Quora *7 Steps to Mastering Data Preparation with Python - blog post *Web Scraping for Data Science with Python - blog post *Intro to Distributed Deep Learning Systems - blog post *Princeton Commodities Modeling Blog *Exploratory data analysis using Python for used car database taken from Kaggle - Github *Detailed exploratory data analysis with Python - Kaggle *Python-camp - Github *Big Data: Spark, Hadoop, Hive, ZooKeeper, Solr, Kafka, Nutch, MongoDB, ... - installation instructions *Deep Learning with Apache Spark and TensorFlow - blog post *Build a Simple Chatbot with Tensorflow, Python and MongoDB - blog post *Visual Data Analysis with Python - blog post *Exploratory Data Analysis with Pandas - blog post *Plotly Python Library Maps *5 Quick and Easy Data Visualizations in Python with Code - blog post *William Koehrsen - blog *ClaoudML - Free Data Science & Machine Learning Resources *Parallel and Distributed Deep Learning by Tal Ben-Nun *An introduction to parallel programming using Python's multiprocessing module - blog post *Putting Machine Learning Models into Production - blog post *Spark + Deep Learning: Distributed Deep Neural Network Training with SparkNet - blog post *Data Science in Python: Pandas Cheat Sheet *Simple Exploratory Data Analysis - PASSNYC - Kaggle *EDA and Clustering - Kaggle *Time Series Anomaly Detection: Optimizing your Machine Learning Jobs in Elasticsearch - webinar *Web Access Logs in Elasticsearch and Machine Learning - webinar *Deploying Python models to production - video *Deploying Machine Learning apps with Docker containers - MUPy 2017 - video *How to deploy machine learning models into production - video *Federated Learning: Collaborative Machine Learning without Centralized Training Data - blog post *Learn to Build Machine Learning Services, Prototype Real Applications, and Deploy your Work to Users - blog post *Deploying Keras Deep Learning Models with Flask - blog post *Introducing Flask-RESTful - blog post *Getting started with Anaconda & Docker - blog post *Docker for Data Science - blog post *How Docker Can Help You Become A More Effective Data Scientist - blog post *Deep Learning Installation Tutorial - Part 4: How to install Docker for Deep Learning - blog post *pyspark (GitHub) - collection of resources * Distributed-TensorFlow-Guide (GitHub) - Distributed TensorFlow basics and examples of training algorithms (with code) *kafka-streams-machine-learning-examples (GitHub) - Machine Learning + Kafka Streams Examples (with code) *How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka - blog post *Realtime Machine Learning predictions with Kafka and H2O.ai - blog post *Deploying deep learning models: Part 1 an overview - blog post *A guide to deploying Machine/Deep Learning model(s) in Production - blog post *How redBus uses Scikit-Learn ML models to classify customer complaints? - blog post *Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data - blog post *How to write a production-level code in Data Science? - blog post